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Abstract. We study the non- overlapping indexing problem: Given a 
text T, preprocess it in order to answer queries of the form: given a 
pattern P, report the maximal set of non-overlapping occurrences of P in 
T. A generalization of this problem is the range non- overlapping indexing 
where in addition we are given two indexes i, j to report the maximal set 
of non-overlapping occurrences between these two indexes. We suggest 
new solutions for these problems. For the non-overlapping problem our 
solution uses 0(n) space with query time of 0(m + occNo)- For the range 
non-overlapping problem we propose a solution with 0(n log e n) space 
for some < e < 1 and 0(m + log log n + ocCij y No) query time. 



1 Introduction and Related Work 

Given a text T of length n over an alphabet S, the text indexing problem is to 
build an index on T which can answer pattern matching queries efficiently: Given 
a pattern P of length to, we want to report all its occurrences in T. There are 
some known solutions for this problem. For instance, the suffix tree, proposed by 
Weiner [1], which is a compacted trie storing all suffixes of the text. A suffix tree 
for text T of length n requires 0(n) space and can be built in 0(n) preprocessing 
time. It has query time of 0(m + occ) where occ is the number of occurrences of 
P in T. 

Range text indexing, also known as position restricted substring searching, is 
the problem of finding a pattern P in a substring of the text T between two given 
positions A solution for this problem was presented by Makinen and Navarro 
[2]. It uses 0(n log 6 n) space and has query time of 0(m + log log n + occ). Their 
solution is based on another problem - the range searching problem. 

The range searching problem is to preprocess a set of points in a d-dimensional 
space for answering queries about the set of points which are contained within 
a specific range. Alstrup et al [3] proposed a solution for the orthogonal two 
dimensional range searching problem when all the points are from n x n grid, 
which costs 0(n log e n) space and has 0(loglogn + k) query time, where k is 
the number of points inside the range. Grossi and Iwona in [4] have shown how 
to use Alstrup's data structure to get all the points inside a particular range in 
a specific order using some kind of rank function. 
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In text indexing we are sometimes interested in reporting only the non- 
overlapping occurrences of P in T. There is such interest in fields such as pat- 
tern recognition, computational linguistics, speech recognition, data compres- 
sion, etc. For instance, we might want to compress a text by replacing each 
non-overlapping occurrence of a substring of it with a pointer to a single copy 
of the substring. 

Another problem is the string statistics problem [5, 6] which consists of pre- 
processing a text T such that when given a query pattern P, the maximum 
number of non-overlapping occurrences of P in T can be reported efficiently. 
However, in the string statistics problem we only return the number of non- 
overlapping occurrences not the actual occurrences. In this paper, we present 
the first non-trivial solution for the non- overlapping indexing problem where we 
want to report the maximal sequence of non-overlapping occurrences of P in T . 

Keller et al [7] proposed a solution for a generalization of this problem 
called the range non-overlapping indexing where we want to report the non- 
overlapping occurrences in a substring of T. Their solution has query time of 
0(m+occij t NO log log n) and uses 0(n log n) space, where 0CCij t NO is the number 
of the maximal non-overlapping occurrences in the substring T[i : j]. 

Crochemore et al [8] suggested another solution for the range non-overlapping 
indexing problem. Their solution has optimal query time of 0(m + ocCij t No) but 
requires 0(n 1+e ) space. 

In this paper, we present new solutions for the non- overlapping indexing 
problem, which use the periodicity of the text and pattern in order to minimize 
the query time. Our solution for non-overlapping indexing problem uses 0(n) 
space with optimal query time of 0(m + occno)- For the range non- overlapping 
indexing problem we present a solution of 0(n log e n) space for some < e < 1 
with 0(m + log log n + occij^o) query time. 

2 Preliminaries 

Let n be the length of the text T. And let m be the length of the pattern P. For 
two integers i, j (i < j), T[i : j] is the substring of T from i to j. 

We will use the Suffix Tree as our main data structure. Each leaf in the Suffix 
Tree represents a suffix in the text. In each leaf we save two values: y - the start 
location of its suffix in the text and x - the location of the leaf in a left to right 
order of all the leaves of the Suffix Tree (lexicographic order, for example). We 
have two orders on the leaves and therefore on the suffixes as well: x-order and 
y-order. The y-order is the text order and the x-order is the suffix tree leaves 
order. 

When we search for a pattern P in a Suffix Tree ST of T, we finish searching 
at some node v. The subtree rooted by v has all the occurrences of P in T in its 
leaves. We denote by I and r the x- value of the leftmost leaf and the x- value of 
the rightmost leaf of that subtree respectively. Therefore, the occurrences of P 
in T are all the leaves with x-value between I and r. 



We denote the number of occurrences of P in T by occ. The number of non- 
overlapping occurrences will be denoted as occno- The number of occurrences of 
P in T[i : j] and the non-overlapping occurrences of P in T[i : j] will be denoted 
by occij and occ^^o respectively. 

3 A Solution for Non-Overlapping Indexing 

We use a new approach for solving this problem. Our solution uses the periodicity 
of the text and the pattern. We divide patterns for two types: periodic and 
aperiodic. A different strategy will be used for each type. 

Definition 1. A pattern that can appear more than twice overlapping is called a 
periodic pattern. A pattern that can appear at most twice overlapping is called 
an aperiodic pattern. 

3.1 Aperiodic Pattern 

In the aperiodic case we use the periodicity of the pattern to answer a query. 
We use the familiar Suffix Tree to get all the leaves that correspond to the 
given pattern. After we have all the leaves, we need to remove the overlapping 
occurrences. This can be done by sorting the leaves in y-order, going over the 
sorted list and filtering the overlapping occurrences. However, sorting occ items 
costs O {occ log occ) which is greater than the optimal O(occ). In order to solve 
this sorting part we use the following theorem. 

Theorem 1. All occurrences of a pattern can be reported and sorted in text 
order in 0(m + occ) time using 0(n) space. 

Proof. We use a Suffix Tree to get all the occurrences in 0(m + occ) time. For 
sorting all the occurrences we will use a renaming method on the Suffix Tree. 

Each leaf has its location, i.e., its y- value index, in the whole tree. Saving this 
location for a leaf costs logn bits because the whole tree has n leaves. Hence, 
the domain for the location is n. Nevertheless, we are interested in the order of 
the leaves in a subtree of the occurrences and not in the whole tree. Thus, we 
would like to save the location of a leaf for a subtree with less leaves. If we save 
for each leaf its location in a subtree with less leaves, for example y/n leaves, it 
will cost us only log y/n. Therefore, for each leaf, aside from keeping its location 
in the whole suffix tree, we save its location in a subtree of size s/n, its location 
in a subtree of size \fn, and so on for all subtrees of size %fn for i > 1 until we 
reach a constant size. 

We use Radix Sort which can sort n numbers in a domain of n 2 in 0(n) time 
for sorting the leaves by their locations. Given a subtree whose leaves we wish 
to sort in y-order, we can sort them by the locations of the subtree of size at 
most 0(occ 2 ), this will cost us only O(occ) by using Radix Sort because we sort 
occ items in a domain of at most occ 2 . 
In each leaf we save: 



logn bits for its location in the ST. 

log y/n bits for its location in the subtree of size \fn. 

log \fn bits for its location in the subtree of size y/n. 

etc. 

This sums as following: log n + log \ph + log \fn + log ^/n... = log n + | log n + 
i logn + | logn + ... < 2 logn. 

Therefore, we save only 2 logn bits per leaf. We have n leaves summing up 
to n • 2 log n = 0(n log n) bits which is 0(n) space. □ 

Theorem 1 provides us a sorted list of all occurrences in 0(m + occ) query 
time and 0(n) space. By filtering the overlapping occurrences which costs O(occ) 
time, we are through. Because in an aperiodic pattern, 0(occ) = O(occno), the 
query time equals to 0(m + occno)- 

3.2 Periodic Pattern 

The periodic case is more complex. In this case we use the periodicity of the text 
in order to answer a query. 

Definition 2. A node in the Suffix Tree which represents a suffix which is a 
periodic pattern is called period node. 

Definition 3. Let s be a string. We define a period of s to be a string p, such 
that s = p f p, for t> I where p is prefix of p. 

Lemma 1. A period node has only one son which can also be a period node. 

Proof. Let a be a period node. Therefore, the string represented by a has a 
period p. For a son of a to be a period node too it must continue the period p. 
If a ends with a character c than the node which has the next character in p 
that is after that c is the period node. There can be only one child of a which 
can start with this character. Thus, a period node can have only one son which 
is also period node. □ 

Note that by the period definition a string can have more than one period, 
by taking p new = pp for example. Nevertheless, each period must overlap with 
a shorter period of the same string. Thus, there can't be more than one such 
character c. 

Definition 4. The path that starts from the a period node and goes through all 
nodes which continue that period in the Suffix Tree is called a period path. We 
denote the number of nodes in a period path to be the period path length. 

Lemma 2. Let p2 be a period path on a path PT to the root in the Suffix Tree. 
Than pi must be at least twice as long as the previous period path pi on PT. 



Proof. On PT, between p\ and p 2 there must be at least one node which is 
not a period node. For p 2 to be a period path is must represent a period suffix 
which must be started with the period of p\. Moreover, p 2 period suffix should 
be continued by the character of the next node after p\ which is not a period 
node. After it there must be the period of pi again. Therefore, p 2 length is at 
least twice longer than pi . □ 

Lemma 3. The largest number of different period paths contained in the path 
from the root to a period node is logn. 

Proof. Let PT be the path from the root to some period node in the Suffix Tree. 
According to Lemma 2 each period path on PT must be at least twice as long 
as the previous period path on PT. Therefore, if we have more than log n period 
paths on PT, than the length of the last period path must be greater than n 
which is the text length. Hence, the number of period paths on the same path 
can't be more than log n. □ 

Definition 5. A period sequence is the maximum substring in the text of 
some period which is repeated more than twice. We will mark it as [s,e], where 
s and e are the start index and the end index of the period sequence accordingly. 
The period sequences of a period pattern are all the period sequences which start 
with that periodic pattern. The period length of a period sequence is the length 
of the period inside repeated the sequence. 

Example 1. Lets T be the text "abababcabababcabababc" . The period sequences 
are: For the period "ab", the period sequences are [1,6], [8,13], [15,20] which have 
period length 2. For the period "abababc", the period sequence is [1,21] which 
has period length 7. 

Lemma 4. Given a list L of all the period sequences of some periodic pattern 
in the text. All the non- overlapping occurrences of that periodic pattern can be 
retrieved from this list in 0(occno) time. 

Proof. For a period sequence [s,e] with period length pi, the non-overlapping 
occurrences of a periodic pattern P with length m are the group: s+i*step\step = 
r^fl * pl,0 < i < ^J^ 1 which can be easily calculated. 

We report all the occurrences in each period sequence in L. The number of 
all occurrences we report is O(occno), so the total time for reporting all the 
non-overlapping occurrences from L is 0(occno)- □ 

For answering the non-overlapping indexing we use the following data struc- 
ture. We build a data structure for each period path on the Suffix Tree, saving 
a list of all period sequences sorted by their length for each period path. Each 
period node in the Suffix Tree is on a period path. We save a pointer from each 
period node on the Suffix Tree to the period sequence list of its period path. This 
pointer will point to the period node appropriate length on the period sequences 
list. 



Theorem 2. Using the data structure described above, all period sequences of a 
period pattern can be retrieved in 0(m + occno) time using O(nlogn) space. 

Proof. On a query we go to that data structure, get all the period sequences 
and by Lemma 4 we calculate all the non-overlapping occurrences. This will 
take O(occno) time, because the number of the period sequences that we will 
get is less than the number of the non-overlapping occurrences of the pattern. 
If we come up with a long pattern we won't get shorter period sequences which 
don't fit the pattern so we won't get unnecessary period sequences. 

The space for this data structure is 0(n log n). This is because there are n 
nodes where each one can be at most in logn period paths. For each node, we 
save all its period paths so it needs O(nlogn). □ 

Now, we will show how to reduce the space needed for this data structure. 

Definition 6. Let's define a degree of a period sequence to be the maximum 
degree of a period sequence included in it plus one. A period sequence without 
any period sequences in it will has the degree 0. 

Lemma 5. The maximum degree of a period sequence can be at most logn. 

Proof. Let ps be the period sequence with the maximum degree in the text. In 
ps there is a period sequence with a degree decreased by one. In that period 
sequence there is another period sequence with a degree decreased by one. And 
so on until we receive a period sequence pso with the degree 0. The length of 
each period sequence from ps to ps is at least twice the length of the period 
sequence in it. The maximum length of ps can be at most n, therefore, its degree 
can be at most logn. □ 

Lemma 6. There are at most 0{n) period sequences. 

Proof. We will count the number of period sequences in each degree: 

The number of period sequences of degree can be at most n. 
The number of period sequences of degree 1 can be at most J. 
The number of period sequences of degree 2 can be at most j. 

The number of period sequences of degree logn can be at most 1. 

Summery: n+| + .. + l<2n = O(n) □ 

Theorem 3. The data structure in Theorem 2 can be saved using only O(n) 
space. 

Proof. We save all the period paths in a data structure. Each one with its own 
period sequences. Each period sequence appears in only one period path. From 
Lemma 6 we have 0{n) period sequences. Therefore we save at most 0(n) space 
for all the period sequences. Thus, we need only 0{n) space for this data struc- 
ture. □ 

Corollary 1. Using these two different strategies for each type of pattern we can 
solve the non-overlapping indexing problem in 0(n) space with 0(m + occ^o) 
query time. 



4 A solution for Range Non-Overlapping Indexing 



We propose a better solution for this problem. Our solution costs (9(nlog e n) 
space for some < e < 1 and has query time of 0(m + log log n + occij^o)- 

4.1 Rank Sensitive Range Searching 

We use a data structure for answering the two-dimensional orthogonal range 
searching problem. Alstrup et al [3] proposed a data structure for this problem 
which requires 0(rtlog c n) space with query time of O (log log n + k) where k is 
the number of points in the range. 

Nevertheless, this range query data structure reports all the points in the 
range with no specific order. We want to get those points in a specific order. 
Therefore, in addition to this data structure we will use a method suggested by 
Grossi et al [4] for a rank sensitive data structure. This gives us a data structure 
which uses 0(n log 6 n) space with query time of 0(log log n) and 0(1) per point, 
where the points will be reported in rank order. For simplicity we will call this 
data structure RSDS from now on. 

4.2 Aperiodic Pattern 

We use a Rank Sensitive Data Structure to answer aperiodic queries. In the 
RSDS we store all the occurrences as points by their x value and y value, where 
the rank function of a point will be its y value. Given a pattern P and range 
i,j we can get l,r from the Suffix Tree, the leftmost leaf and the rightmost leaf 
which are occurrences of P. Then we will do a range query for points within 
[i, r] to get all the correct occurrences. Because the rank in the RSDS is by 
y value, we will get the points and therefore the occurrences, sorted in the text 
order. The only remaining action is filtering the overlapping occurrences. 

The RSDS costs 0(n log 6 n) space. RSDS query time is 0(log log n + k) where 
k is the size of the output which is equal to O(occij). In our case k equals 
0(ocCij t No) because for an aperiodic pattern O(occij) — 0{ocCij t No)- Therefore, 
aperiodic pattern has query time of 0(m) for searching the Suffix Tree plus 
0(loglogra + occij^o) for the RSDS query. Concluded in 0(m + log log n + 

4.3 Periodic Pattern 

The periodic case is more complex. We save all period sequences in the text as 
points in two RSDS. For a periodic sequence [s, e] with period length pi, we save 
two points (x\,yi), (#2,2/2) with the following values: 

x\ = Index of the suffix of s in the left to right order of all the ST leaves 

yi = s 

X2 = Index of the suffix of s in the left to right order of all the ST leaves 
2/2 = e - pi + 1 



Point (xi,yi) will be saved in the first RSDS with a rank function of x value 
in descending order. Point (X2, 2/2) will be saved in the second RSDS with a rank 
function of x value in ascending order. Sometimes there will be multiple period 
sequences with the same start index or end index, each with a different degree. 
When this happens we save only the one with the highest degree. We can easily 
convert a period sequence [s, e] to these two points and vice versa. 

Following Lemma 6 the number of points in the two RSDS is 0(n). Hence, 
each RSDS costs 0(nlog e n) space. 

Given a pattern P of length m and range i, j we answer using Algorithm 1. 

1 Get the range I, r from the Suffix Tree ; 

2 S i — Query first RSDS for all points within [i,j']a;[Z,r] ; 

3 S i — S U Query second RSDS for all points within [i, j]a:[Z, r] ; 

4 S i — S U Query first RSDS for the first point within [l,i]a;[/,r] ; 

5 PS < — convertAUPointsToPeriodSequences(S) ; 

6 PS2 <— ; 

7 for ps e PS do 

8 x i — ps ; 

9 while x period length is greater than in do 

10 x i — the first period sequence inside x ; 

11 end 

12 PS2 <— PS2 U {x} 

13 end 

Algorithm 1: Periodic Pattern Range Query 

Getting the first period sequence inside a period sequence can be done by 
using another data structure saving for each period sequence its period length, 
and a pointer to the first period sequence in it. Thus, given a period sequence it 
costs 0(1) to find the first period sequence inside it with a degree decreased by 
one. 

Lemma 7. The number of period sequences we have to go down in order to find 
our appropriate period sequence is lower than the number of occurrences that 
will be extracted from the period sequences inside the first period sequence we 
received. 

Proof. Each degree we get down means that there is another occurrence in the 
next period sequence. Each step down, adds at least another occurrence. There- 
fore, until we get to the appropriate period sequence we work at most 0{k) where 
k is the number of occurrences we will get from the period sequences inside the 
first period sequence we encounter. □ 

By Lemma 7 it does not cost us more time when we get a period sequence 
whose period length is longer than the pattern length. 

Theorem 4. All the non- overlapping occurrences can be calculated from the 
period sequences got by these three queries. 



Proof. First of all we will see that each period sequence we get from these queries 
has at least one occurrence of P. Let (x, y) be the point we get. It corresponds 
to a period sequence [s,e]. We get only points which have x values between I 
and r. The x value of a point is the index of the suffix of s in the left to right 
order of all the ST leaves. So if we get a point (x, y) with x between I and r 
it means that the corresponding period sequence [s, e] has an occurrence of P. 
This happens because all the leaves in the ST between I and r are occurrences 
of P. 

Now, we need to prove two more things. The first is that all the period 
sequences we get from the RSDS are suitable for us and that we haven't got 
unnecessary period sequences, which don't fit to P in the range The second 
thing is that we didn't miss any period sequence which can have some suitable 
occurrences. 

We start by proving that we get all the occurrences of P in the range [i,j] 
from the period sequences we get in the three queries. Period sequences of P in 
T can be in some cases. Let [s, e] be our period sequence. 

The first case is that [s, e] is out of the range s < e < i < j or 

i < i < s < e. In this case we wouldn't like to get this period sequences at 
all. The first two queries will not resolve these period sequences because we do 
a query on [i,j]a;[i,r] but s is out of range and the points corresponding to 
this period sequence have y value equals s. Nevertheless, we can get this period 
sequence in the third query However, it will be at most one period sequence 
which can be checked in O(l) time. 

The second case, which is the simplest, is that [s, e] is fully inside the range 
i < s < e < 3- In this case we get all the suitable period sequences from 
the first query. Nevertheless, we can get the same period sequence twice, first 
from the first query and again from the second query. Therefore, we will have to 
check any period sequence that we get in order to prevent duplicate occurrence 
reporting. 

The third case is when only e or s is out of range but not both, s < i < e < j 
or i < s < j < e. This time if i < s < j < e we will get the period sequence from 
the first query. Otherwise, if s < i < e < j we will get the period sequence from 
the second query. 

The fourth case is when the range is fully inside the period sequence 
[s, e], s < i < j < e. In order to solve this case we have the third query which will 
resolve the last start of a period sequence which is before index i. This period 
sequence can be checked in 0(1) time. □ 

Corollary 2. Using these two different strategies for each type of pattern, the 
range non- overlapping text indexing problem can be solved in 0(nlog £ n) space 
for some < e < 1 and query time of 0(m + loglogn + occij.No)- 

5 Conclusion 

We have studied the problem of non-overlapping indexing. In this paper, we 
provide the first non-trivial solution for this problem. In addition we proposed 



a better solution for a generalization of this problem, the range non-overlapping 
problem. 
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